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Abstract 

The problem of ranking/ordering instances, instead of simply clas- 
sifying them, has recently gained much attention in machine learning. 
In this paper we formulate the ranking problem in a rigorous statistical 
framework. The goal is to learn a ranking rule for deciding, among two 
instances, which one is "better", with minimum ranking risk. Since the 
natural estimates of the risk are of the form of a U-statistic, results of 
the theory of U-processes are required for investigating the consistency 
of empirical risk minimizers. We establish in particular a tail inequality 
for degenerate U-processes, and apply it for showing that fast rates of 
convergence may be achieved under specific noise assumptions, just like 
in classification. Convex risk minimization methods are also studied. 
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1 Introduction 



Motivated by various applications including problems related to document re- 
trieval or credit-risk screening, the ranking problem has received increasing 
attention both in the statistical and machine learning literature. In the rank- 
ing problem one has to compare two different observations and decide which 
one is "better". For example, in document retrieval applications, one may 
be concerned with comparing documents by degree of relevance for a particu- 
lar request, rather than simply classifying them as relevant or not. Similarly, 
credit establishments collect and manage large databases containing the socio- 
demographic and credit-history characteristics of their clients to build a ranking 
rule which aims at indicating reliability. 

In this paper we define a statistical framework for studying such ranking 
problems. The ranking problem defined here is closely related to Stute's condi- 
tional \l-statistics [36, 37]. Indeed, Stute's results imply that certain non- 
parametric estimates based on local U-statistics gives universally consistent 
ranking rules. Our approach here is different. Instead of local averages, we 
consider empirical minimizers of U-statistics, more in the spirit of empirical 
risk minimization popular in statistical learning theory, see, e.g., Vapnik and 
Chervonenkis [40], Bartlett and Mendelson [6], Bousquet, Boucheron, Lugosi 
[8], Koltchinskii [24], Massart [29] for surveys and recent development. The im- 
portant feature of the ranking problem is that natural estimates of the ranking 
risk involve U-statistics. Therefore, the methodology is based on the theory of 
U-processes, and the key tools involve maximal and concentration inequalities, 
symmetrization tricks, and a "contraction principle" for U-processes. For an 
excellent account of the theory of U-statistics and U-processes we refer to the 
monograph of de la Pena and Gine [12]. 

Furthermore we provide a theoretical analysis of certain nonparametric rank- 
ing methods that are based on an empirical minimization of convex cost func- 
tional over convex sets of scoring functions. The methods are inspired by 
boosting-, and support vector machine-type algorithms for classification. The 
main results of the paper prove universal consistency of properly regularized 
versions of these methods, establish a novel tail inequality for degenerate U- 
processes and, based on the latter result, show that fast rates of convergence 
may be achieved for empirical risk minimizers under suitable noise conditions. 

We point out that under certain conditions, finding a good ranking rule 
amounts to constructing a scoring function s. An important special case is 
the bipartite ranking problem in which the available instances in the data are 
labelled by binary labels (good and bad). In this case the ranking criterion is 
closely related to the so-called AUG (area under the "roc" curve) criterion (see 
the Appendix for more details). 

The rest of the paper is organized as follows. In Section 2, the basic models 
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and the two special cases of the ranking problem we consider are introduced. 
Section 3 provides some basic uniform convergence and consistency results for 
empirical risk minimizers. Section 4 contains the main statistical results of 
the paper, establishing performance bounds for empirical risk minimization for 
ranking problems. In Section 5, we describe the noise assumptions which guar- 
antee fast rates of convergence in particular cases. In Section 6 a new expo- 
nential concentration inequality is established for U-processes which serves as 
a main tool in our analysis. In Section 7 we discuss convex risk minimization 
for ranking problems, laying down a theoretical framework for studying boost- 
ing and support vector machine-type ranking methods. In the Appendix we 
summarize some basic properties of U-statistics and highlight some connections 
of the ranking problem defined here to properties of the so-called ROC curve, 
appearing in related problems. 

2 The ranking problem 

Let (X,Y) be a pair of random variables taking values in A" x M where X is 
a measurable space. The random object X models some observation and Y its 
real-valued label. Let (X',Y') denote a pair of random variables identically 
distributed with (X, Y), and independent of it. Denote 

Y-Y' 

In the ranking problem one observes X and X' but not their labels Y and Y'. We 
think about X being "better" than X' if Y > Y', that is, if Z > 0. (The factor 
1/2 in the definition of Z is not significant, it is merely here as a convenient 
normalization.) The goal is to rank X and X' such that the probability that the 
better ranked of them has a smaller label is as small as possible. Formally, a 
ranking rule is a function r : X x X {—1,1}. If r(x,x') = 1 then the rule 
ranks x higher than x'. The performance of a ranking rule is measured by the 
ranking risk 

L(T)=nZ-r(X,X')<0}, 

that is, the probability that r ranks two randomly drawn instances incorrectly. 
Observe that in this formalization, the ranking problem is equivalent to a binary 
classification problem in which the sign of the random variable Z is to be guessed 
based upon the pair of observations (X,X'). Now it is easy to determine the 
ranking rule with minimal risk. Introduce the notation 

p+(X,X') =P{Z>0 I X,X'} 
p_(X,X') =IP{Z< I X,X'}. 

Then we have the following simple fact: 
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Proposition 1 Define 



r*[x,x') = 2I[p^(x,x')>p-(x,x')] ~ 1 



and denote L* = L(r*) = E{mm(p+(X,X'), p_(X,X'))}. Then for any ranking 



PROOF. Let r be any ranking rule. Observe that, by conditioning first on (X, X'), 
one may write 



It is now easy to check that L(r) is minimal for r = r*. | 

Thus, r* minimizes the ranking risk over all possible ranking rules. In the 
definition of r* ties are broken in favor of p+ but obviously if p+(x,x') = 
p_(x, x'), an arbitrary value can be chosen for r* without altering its risk. 

The purpose of this paper is to investigate the construction of ranking rules 
of low risk based on training data. We assume that n independent, identically 
distributed copies of (X, Y), are available: = (Xi , Y] ), . . . , (Xn, Yn). Given 
a ranking rule r, one may use the training data to estimate its risk L(r) = 
W{Z ■ r(X, X') < 0}. The perhaps most natural estimate is the ll-statistic 



In this paper we consider minimizers of the empirical estimate Ln(r) over a 
class TZ of ranking rules and study the performance of such empirically selected 
ranking rules. Before discussing empirical risk minimization for ranking, a few 
remarks are in order. 

Remark 1 Note that the actual values of the Yi's are never used in the ranking 
rules discussed in this paper. It is sufficient to know the values of the Zt j , or, 
equivalently, the ordering of the Yt's. 

Remark 2 (a MORE GENERAL FRAMEWORK.) One may consider a generaliza- 
tion of the setup described above. Instead of ranking just two observations X, X', 
one may be interested in ranking m independent observations X'^ X'"^' . 
In this case the value of a ranking function r{X^^ \ . . . ,X^^^) is a permutation 
7T of {1 , . . . , m) and the goal is that n should coincide with (or at least resemble 
to) the permutation n for which Y'"^'^" > • • • > Y'"'™-". Given a loss function 
i that assigns a number in [0, 1] to a pair of permutations, the ranking risk is 
defined as 



rule r 



V < L(t) . 



L(r) =E(l[,(x,x')=ilP-(X,X')+I[r(x,x')=-i]P+(X,X')). 




L(r) =Ee(r(X'^\...,X'^'),7T) . 
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In this general case, natural estimates of L(t) involve m-th order U-statistics. 
Many of the results of this paper may be extended, in a more or less straight- 
forward manner, to this general setup. In order to lighten the notation and 
simplify the arguments, we restrict the discussion to the case described above, 
that is, to the case when ra = 2 and the loss function is £(7t,7t) = I[n:^jt\- 

Remark 3 (RANKING AND SCORING.) In many interesting cases the ranking 
problem may be reduced to finding an appropriate scoring function. These are 
the cases when the joint distribution of X and Y is such that there exists a 
function s* : A:" ^ M such that 

r*(x,x') = 1 if and only if s*(x) > s*(x') . 

A function s* satisfying the assumption is called an optimal scoring function. 
Obviously, any strictly increasing transformation of an optimal scoring function 
is also an optimal scoring function. Below we describe some important special 
cases when the ranking problem may be reduced to scoring. 

Example 1 (the bipartite ranking problem.) In the bipartite ranking 
problem the label Y is binary, it takes values in {—1 , 1}. Writing ri(x) = P{Y = 
1 |X = x}, it is easy to see that the Bayes ranking risk equals 

L* =Emin{ii(X)(1 -ti(X')),ii(X')(1 -ti(X))} 
= Emin{ii(X),Ti(X')}-(Eii(X))2 

and also, 

L*=Var(^^j -1e|ti(X)-ii(X')I • 

In particular, 

L* < Var (^^-^^ < 1/4 

where the equality L* = Var (^-y^) holds when X and Y are independent and the 
maximum is attained when r[ = 1 /2. Observe that the difficulty of the bipartite 
ranking problem depends on the concentration properties of the distribution 
of r|(X) = P(Y = 1 I X) through the quantity E(|ri(X) -r|(X')|) which is a 
classical measure of concentration, known as Gini 's mean difference. For given 
p = E(ti(X)), Gini's mean difference ranges from a minimum value of zero, when 
ri(X) = p, to a maximum value of -jp^l — p) in the case when r|(X) = (Y + 1 ) /2. 
It is clear from the form of the Bayes ranking rule that the optimal ranking rule is 
given by a scoring function s* where s* is any strictly increasing transformation 
of r\. Then one may restrict the search to ranking rules defined by scoring 
functions s, that is, ranking rules of form r(x,x') = 2I[s(x)>s(x')] ~ ^- Writing 
L(s) =^ one has 

L(s) -L* =E(|ti(X') -Ti(X)|I[(s(x)-s(X'))(n(x)-n(X'))<o]) • 
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We point out that the ranking risk in this case is closely related to the AUG 
criterion which is a standard performance measure in the bipartite setting (see 
[14] and Appendix 2). More precisely, we have: 

Auc(s) =P{s(X) > s(X') I Y = 1 , r = -1} = 1 - ^-J— -^L(s), 

where p = P (Y = 1 ) , so that maximizing the AUG criterion boils down to mini- 
mizing the ranking error. 

Example 2 (a REGRESSION model). Assume now that Y is real- valued and 
the joint distribution of X and Y is such that Y = Ta(X) + e where Ta(x) = 
E(Y|X = x) is the regression function, e is independent of X and has a sym- 
metric distribution around zero. Then clearly the optimal ranking rule r* may 
be obtained by a scoring function s* where s* may be taken as any strictly 
increasing transformation of m. 



3 Empirical risk minimization 

Based on the empirical estimate Lnij) of the risk L(r) of a ranking rule defined 
above, one may consider choosing a ranking rule by minimizing the empirical 
risk over a class 72. of ranking rules r : X x X ^ {—1 , 1}. Define the empirical 
risk minimizer, over 72., by 

r-n = argminLn(r) . 

(Ties are broken in an Eirbitrary way.) In a "first-order" approach, we may study 
the performance L(rn) = P{Z-rn(X, X') < 0|Dn} of the empirical risk minimizer 
by the standard bound (see, e.g., [13]) 

L(r^) - inf L(r) < 2 sup |U(r) - L(r)| . (1) 

This inequality points out that bounding the performance of an empirical min- 
imizer of the ranking risk boils down to investigating the properties of U- 
processes, that is, suprema of U-statistics indexed by a class of ranking rules. 
For a detailed and modern account of U-process theory we refer to the book 
of de la Pefia and Gine [12]. In a first-order approach we basically reduce the 
problem to the study of ordinary empirical processes. 

By using the simple Lemma 14 given in the Appendix, we obtain the fol- 
lowing: 

Proposition 2 Define the Rademacber average 

1 



Rn = sup 

re-R. [n/2\ 



Y- ^^^[Zi,Ln/2j+ir(Xi,XL„/2j+i)<0] 



i=1 
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where €^ , en are i.i.d. Rademacher random variables (i.e., random sym- 
metric sign variables) . Then for any convex nondecreasing function i|>, 

PROOF. The inequality follows immediately from (1), Lemma 14 (see the Ap- 
pendix), and a standard symmetrization inequality, see, e.g., Gine and Zinn 
[17]. I 

One may easily use this result to derive probabilistic performance bounds 
for the empirical risk minimizer. For example, by taking tJj(x) = e^" for some 
A > 0, and using the bounded differences inequality (see McDiarmid [31]), we 
have 

Eexp I A{L(rn) - inf L(r)) 

< Eexp{4ARn) 

/ 4A2 

< exp 4AERn 



y—'^ ' (n-1] 

By using Markov's inequality and choosing A to minimize the bound, we readily 
obtain: 

Corollary 3 Let 6 > 0. With probability at least 1—6, 



ERn < c 



L(rn) - inf L(t) < 4ERn +4,/^^^^^ . 
ren V u— I 

The expected value of the Rademacher average R^ may now be bounded by 
standard methods, see, e.g., Lugosi [27], Boucheron, Bousquet, and Lugosi [8]. 
For example, if the class 72. of indicator functions has finite VC dimension V, 
then 

/V 
n 

for a universal constant c. 

This result is similar to the one proved in the bipartite ranking case by 
Agarwal, Graepel, Herbrich, Har-Peled, and Roth [2] with the restriction that 
their bound holds conditionally on a label sequence. The analysis of [2] relies 
on a particular complexity measure called rank-shatter coefficient but the core 
of the argument is the same. 

The proposition above is convenient, simple, and, in a certain sense, not im- 
provable. However, it is well known from the theory of statistical learning and 
empirical risk minimization for classification that the bound (1) is often quite 
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loose. In classification problems the looseness of such a "first-order" approach 
is due to the fact that the variance of the estimators of the risk is ignored and 
bounded uniformly by a constant. Therefore, the main interest in considering 
U-statistics precisely consists in the fact that they have minimal variance among 
all unbiased estimators. However, the reduced-variance property of U-statistics 
plays no role in the above analysis of the ranking problem. Observe that all 
upper bounds obtained in this section remain true for an empirical risk mini- 
mizer that, instead of using estimates based on U-statistics, estimates the risk 
of a ranking rule by splitting the data set into two halves and estimates L(r) by 

[n/l\ ^ ^[Zi,Ln/2J+i-r(Xi,XLn/2J+i)<0] • 

Hence, in the previous study one loses the advantage of using U-statistics. In 
Section 4 it is shown that under certain, not uncommon, circumstances sig- 
nificantly smaller risk bounds are achievable. There it will have an essential 
importance to use sharp exponential bounds for U-processes, involving their 
reduced variance. 



4 Fast rates 

The main results of this paper show that the bounds obtained in the previous 
section may be significantly improved under certain conditions. It is well known 
(see, e.g., §5.2 in the survey [8] and the references therein) that tighter bounds 
for the excess risk in the context of binary classification may be obtained if one 
can control the variance of the excess risk by its expected value. In classification 
this can be guaranteed under certain "low-noise" conditions (see Tsybakov [39], 
Massart and Nedelec [30], Koltchinskii [24]). 

Next we examine possibilities of obtaining such improved performance bounds 
for empirical ranking risk minimization. The main message is that in the rank- 
ing problem one also may obtain significantly improved bounds under some 
conditions that are analogous to the low-noise conditions in the classification 
problem, though quite different in nature. 

Here we will greatly benefit from using U-statistics (as opposed to splitting 
the sample) as the small variance of the U-statistics used to estimate the ranking 
risk gives rise to sharper bounds. The starting point of our analysis is the 
Hoeffding decomposition of U-statistics (see Appendix 1). 

Set first 

qr((x,y), (x',y')) = I[(v-V') t(x,x')<0] - I[(v-V') t*(x,x')<0] 

and consider the following estimate of the excess risk A(r) = L(r) — L* = 
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Eq,((X,Y),(X',Y')): 



nin ij 

which is a U-statistic of degree 2 with symmetric kernel qr- Clearly, the mini- 

mizer tn of the empirical ranking risk Ln(t) over TZ also minimizes the empirical 
excess risk Aru(i"). To study this minimizer, consider the Hoeffding decomposi- 
tion of An(r): 

An(r)-A(r)=2Tn(r)+Wn(r), 

where 

1 ^ 

Tn(r) = -X.^r(Xi,Yi) 

1=1 

is a sum of i.i.d. random variables with 

h,(x,y) =Eq,((x,y),(X',Y'))-A(r) 

and 

WnW = — !— -^W((Xt,YO,(Xj,Yj)) 
n(n— i] 

is a degenerate U-statistic with symmetric kernel 

K[{x,y),[x',y')) = qr((x,y), (x',y ')) - A(r) - hr(x,y) - h.r(x',y ') . 

In the analysis we show that the contribution of the degenerate part Wn(T) 
of the U-statistic is negligible compared to that of T,^(r]. This means that 
minimization of An is approximately equivalent to minimizing Tn(r). But since 
Tti(t) is an average of i.i.d. random variables, this can be studied by known 
techniques worked out for empirical risk minimization. 

The main tool for handling the degenerate part is a new general moment 
inequality for U-processes that may be interesting on its own right. This in- 
equality is presented in Section 6. We mention here that for VC classes one may 
use an inequality of Arcones and Gine [4]. 

It is well known from the theory of empirical risk minimization (see Tsybakov 
[39], Bartlett and Mendelson [6], Koltchinskii [24], Massart [29]), that, in order 
to improve the rates of convergence (such as the bound 0(i/V/n) obtained for 
VC classes in Section 3), it is necessary to impose some conditions on the joint 
distribution of (X, Y). In our case the key assumption takes the following form: 

Assumption 4 There exist constants c > and a. G [0, 1] such that for all 

ren, 

Var(lv(X,Y)) < cA(r)« . 
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The improved rates of convergence will depend on the value of a. We will 
see in some examples that this assumption is satisfied for a surprisingly large 
family of distributions, guaranteeing improved rates of convergence. For a = 
the assumption is always satisfied and the corresponding performance bound 
does not yield any improvement over those of Section 3. However, we will see 
that in many natural examples Assumption 4 is satisfied with values of a close 
to one, providing significant improvements in the rates of convergence. 

Now we are prepared to state and prove the main result of the paper. In 
order to state the result, we need to introduce some quantities related to the 
class 72.. Let ei , . . . , en be i.i.d. Rademacher random variables independent of 
the (Xi.Yi). Let 



Ze = sup 



^etejh:r((X^,YO,(Xj,Yj)) 



Ue = sup sup Y eiajHr((Xi,Yt), (Xj, Yj)) , 

re7ia:||a||2<1 JJ 



M 



sup 

r£^i,^^=^ ,...,n 



^eiH,((Xi,YO,(Xk,Yk)) 



i=1 



Introduce the "loss function" 

^(t, (x,y)) =2EI[(y_Y).r(x,x)<0] -l-(r) 

and define 



^n(r) = -f nr,(Xi,YO)-L(r) . 



1=1 



(Observe that yn[T) has zero mean.) Also, define the pseudo-distance 



d(r,r') = (E(E[I[r(x,x')^r'(x,X')]IX])^) 



1/2 



Let (j) : [0, oo) [0, oo) be a nondecreasing function such that 4)(x)/x is nonin- 
creasing and 4)(1) > 1 such that for all r £ 72., 

^/n;E sup KM -^n(r')l < *(cr) . 

T'6R,d(r,r')<a- 

Theorem 5 Consider a minimizer Tn of the empirical ranking risk Lrt(r) 
over a class 72. of ranking rules and assume Assumption 4- Then there 
exists a universal constant C such that, with probability at least 1—6, the 
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ranking risk of satisfies 

L(rn)-L* < zfinfLM-r 

+c(^ + ^^^\/^°gn/^) ^ EMlog(1/6) ^ log(1/6) 

+ p2log(1/6)) 
where p > is the unique solution of the equation 

The theorem provides a performance bound in terms of expected values of 
certain Rademacher chaoses indexed by TZ and local properties of an ordinary 
empirical process. These quantities have been thoroughly studied and well 
understood, and may be easily bounded in many interesting cases. Below we 
will work out an example when 72. is a VC class of indicator functions. 

PROOF. We consider the Hoeffding decomposition of the U-statistic A^ir) that is 
minimized over r £ 72.. The idea of the proof is to show that the degenerate part 
Wn{i") is of a smaller order and becomes negligible compared to the part Trt{i"). 
Therefore, r-a is an approximate minimizer of Tn{r) which can be handled by 
recent results on empirical risk minimization when the empirical risk is defined 
as a simple sample average. 
Let A be the event on which 

sup|Wn(r)| < K 

where 



_ / EZe , EUeVlog(1/5) EM log (1/6) log(1/6) ^ 
" ~ ^ 1^ n2 + n2 + n2 ^ J 

for an appropriate constant C. Then by Theorem 11, P[A] > 1 — 6/2. By the 
Hoeffding decomposition of the U-statistics An(i") it is clear that, on A, Vn is a 
p-minimizer of 

over r G 72. in the sense that the value of this latter quantity at its minimum is 
at most K smaller than at Tn . 

Define fn as Tn on A and an arbitrary minimizer of (2/n) Y-^^t 
on A*^. Then clearly, with probability at least 1 — 6/2, L(Tn) = L(fTi) and is 
a K-minimizer of (2/n) Y.i^=^ (^i> ^i))- But then we may use Theorem 8.3 
of Massart [29] to bound the performance of fn which implies the theorem. | 
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Observe that the only condition for the distribution is that the variance of 
Hr can be bounded in terms of A(r). In Section 5 we present examples in which 
Assumption 4 is satisfied with a > 0. We will see below that the value of a 
in this assumption determines the magnitude of the last term which, in turn, 
dominates the right-hand side (apart from the approximation error term). 

The factor of 2 in front of the approximation error term infr^n L(r) — L* has 
no special meaning. It can be replaced by any constant strictly greater than 
one at the price of increasing the value of the constant C. Notice that in the 
bound for L(rr^) — l.* derived from Corollary 3, the approximation error appears 
with a factor of 1. Thus, the improvement of Theorem 5 is only meaningful if 
infrgK L(r) — L* does not dominate the other terms in the bound. Ideally, the 
class TZ should be chosen such that the approximation error and the other terms 
in the bound are balanced. If this was the case, the theorem would guarantee 
faster rates of convergence. Based on the bounds presented here, one may design 
penalized empirical minimizers of the ranking risk that select the class TZ from 
a collection of classes achieving this objective. We do not give the details here, 
we just mention that the techniques presented in Massart [29] and Koltchinskii 
[24] may be used in a relatively straightforward manner to derive such "oracle 
inequalities" for penalized empirical risk minimization in the present framework. 

In order to illustrate Theorem 5, we consider the case when 72. is a VC class, 
that is, it has a finite VC dimension V. 

Corollary 6 Consider the minimizer r-„. of the empirical ranking risk ^-n[^] 
over a class TZ of ranking rules of finite vc dimension V and assume As- 
sumption 4- Then there exists a universal constant C such that, with prob- 
ability at least 1—6, the ranking risk of Tn satisfies 

UrJ-L-<2(i„,U,,-L.)+c(XMV!iy''""' 

PROOF. In order to apply Theorem 5, we need suitable upper bounds for EZe , 
EUe , EM, and p. To bound EZe , observe that Zg is a Rademacher chaos 
indexed by TZ for which Propositions 2.2 and 2.6 of Arcones and Gine [3] may 
be applied. In particular, by using Haussler's [19] metric entropy bound for VC 
classes, it is easy to see that there exists a constant C such that 

EZe < CnV . 

Similarly, Eg M is just an expected Rademacher average that may be bounded 
by C^A^ (see, e.g., [8]). 
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Also, by the Cauchy-Schwarz inequality, 



EU^ < E sup 



\ 



Y_ |^^etW((Xi,Yi),(Xj,Yj))j 
Esup<'^^H,((Xt,Yt),(Xj,Yj))2 



■^^eiekHr({Xi,Yt),(Xj,Yj))Hr((Xj,Yj),(Xk,Yk)) 



< u2 + Esup^^eieicHr((Xt,Yi),(Xj,Yj))h:r((Xj,Yj),(Xk,Yk)) . 



Observe that the second term on the right-hand side is a Rademacher chaos of 
order 2 that can be handled similarly to EZg . By repeating the same argument, 
one obtains 

EU^ <n^ + CVn^ 



Thus, 



E(Ue) < y^E(U2 ) < CnV^/^ . 



This shows that the value of k defined in the proof of Theorem 5 is of the order 

of (V + log(V6)). The main term in the bound of Theorem 5 is p'^. By 
mimicking the argument of Massart [29, pp. 297-298], we get 



Vlogn 



n 



1/(2-a) 



as desired. | 



5 Examples 

5.1 The bipartite ranking problem 

Next we derive a simple sufficient condition for achieving fast rates of conver- 
gence for the bipartite ranking problem. Recall that here it suffices to consider 
ranking rules of the formr(x,x') = 2I[s(x)>s(x')]~l where s is a scoring function. 
With some abuse of notation we write hs for Iv. 

Noise assumption. There exist constants c > and a £ [0, 1] such that for 
all xe X, 

Ex'(lTi(x)-ii(X')r") <c. (2) 
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Proposition 7 Under (2), we have, for all s ^ T 



Var(hs(X,Y)) < cA(s) 



a 



PROOF 



Var(h3(X,Y)) 



< Ex [(Ex'(I[(s(x)-s(X'))(n(x)-n(X'))<0]))^] 

< Ex [Ex' (I[(s(x)-s(X'))(ti(x)-ti(X'))<o] Iti(X) -Ti(X')r) 



X (Ex'(|Ti(X)-ii(X')r"))] 



(by the Cauchy-Schwarz inequality) 

< c (ExEx' (l[(s(x)-s(X'))(ti(x)-n(X'))<o] |ti(X) -ti(X')I)) 



a 



(by Jensen's inequality and the noise assumption) 
= cA(s)<^ . 



I 



Condition (2) is satisfied under quite general circumstances. If a = then 
clearly the condition poses no restriction, but also no improvement is achieved 
in the rates of convergence. On the other hand, at the other extreme, when 
a = 1 , the condition is quite restrictive as it excludes r\ to be differentiable, for 
example, if X has a uniform distribution over [0, 1]. However, interestingly, for 
any a < 1 , it poses quite mild restrictions as it is highlighted in the following 
example: 

Corollary 8 Consider the bipartite ranking problem and assume thatr\(x) = 
IP{Y = ^\X = x} is such that the random variable r\{X) has an absolutely con- 
tinuous distribution on [0,1] with a density bounded by B. Then for any 



and therefore, by Propositions 4 and 7, there is a constant C such that for 
every 5,e G (0,1), the excess ranking risk of the empirical minimizer r-n 
satisfies, with probability at least 1—6, 



PROOF. The corollary follows simply by checking that (2) is satisfied for any 



e >0, 



vxe;f, Ex'(|Ti(x)--n(x')l 



-1+e 





14 



a = 1 — e < 1 . Denoting the density of Ti(X) by f , we have 

r1 1 



Ex'(lnW-Ti(X')l 



|ri(x) -u| 



:f (u)du 



< B 



1 



|ri(x) -u|' 



-du 



1-a 



1 -a 



I 



The condition (2) of the corollary requires that the distribution of ti(X) is 
sufficiently spread out, for example it cannot have atoms or infinite peaks in its 
density. Under such a condition a rate of convergence of the order of u"^"'"^ is 
achievable for any e > 0. 

Remark 4 Note that we crucially used the reduced variance of the U-statistic 
L(rn) to derive fast rates from the rather weak condition (2). Applying a similar 
reasoning for the variance of qs((X, Y), (X', Y')) (which would be the case if one 
considered a risk estimate based on independent pairs by splitting the training 
data into two halves, see Section 3), would have led to the condition: 



|ri(x) -Ti(x')| > c, 



(3) 



for some constant c, and x ^ x' . This condition is satisfied only when ri(X) has 
a discrete distribution. 



5.2 Noiseless regression model 

Next we consider the noise-free regression model in which Y = m.(X] for some 
(unknown) function m : A:" — > ffi. Here obviously L* = and the Bayes rank- 
ing rule is given by the scoring function s* = m (or any strictly increasing 
transformation of it). Clearly, in this case 

qrlx.x') = I[(m(x)-m(x'))T(x,x')<0] 

and therefore 

Var(tv(X,Y)) <Eq?(X,X')=L(T) , 

and therefore the condition of Proposition 4 is satisfied with c = 1 and a = 1 . 
Thus, the risk of the empirical risk minimizer r-n satisfies, with probability at 
least 1—6, 

L(r.)<2infL(r)+C^^l^^ 
r£n n 

provided 72. has finite VC dimension V. 
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5.3 Regression model with noise 

Now we turn to the general regression model with heteroscedastic errors 
in which Y = Ta(X) + a(X)e for some (unknown) functions ra : ^ — > E and 
a : <Y — > M, where e is a standard gaussian random variable, independent of X. 
We set 

m(X) -m(X') 



A(X,X') 



We have again s* = ra (or any strictly increasing transformation of it) and 
the optimal risk is 

L* =EO (-[A(X,X')|) 

where O is the distribution function of the standard gaussian random variable. 
The maximal value of L* is attained when the regression function ra(x) is con- 
stant. Furthermore, we have 

L(s)-L* =E(|2<1)(A(X,X'))-1| •I|(,^(,)_^(,,)).(,(,)_,(,,))<o]) . 

Noise assumption. There exist constants c > and a G [0, 1] such that for 
all xe X, 

Ex-(|A(x,X')r«) <c. (4) 
Proposition 9 Under (4), we have, for all s £ T 

Var(h3(X,Y)) < (20(0) - 1 ) A(s)« . 

PROOF. By symmetry, we have 

12(5 (A(X,X'))-1|=20(|A(X,X')I)-1. 

Then, using the concavity of the distribution function O on M-|- , we have, 
by Jensen's inequality, 

Vxe ^, Ex'0(|A(x,X')r") < 0(Ex'|A(x,X')|-«) < 0(c) , 

where we have used (4) together with the fact that 3> is increasing. Now the 
result follows following the argument given in the proof of Proposition 7. | 

The preceding noise condition is fulfilled in many cases, as illustrated by the 

example below. 

Corollary 10 Suppose thatm(X) has a bounded density and the conditional 
variance <j(x) is bounded over X. Then the noise condition (4) is satisfied 
for any a < 1 . 

Remark 5 The argument above still holds if we drop the gaussian noise as- 
sumption. Indeed we only need the random variable e to have a symmetric 
density decreasing over W+ . 
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6 A moment inequality for U-processes 



In this section we establish a general exponential inequality for U-processes. 
This result is based on moment inequalities obtained for empirical processes 
and Rademacher chaoses in Bousquet, Boucheron, Lugosi, and Massart [9] and 
generalizes Ein inequality due to Arcones and Gine [4]. We also refer to the 
corresponding results obtained for U-statistics by Adamczak [1], Gine, Latala, 
and Zinn [16], and Houdre and Reynaud-Bouret [22]. 

Theorem 11 Let X,Xi,...,Xti be i.i.d. random variables and let T be a 
class of kernels. Consider a degenerate \l-process Z of order 2 indexed by 



Z = sup 



where Ef(X,x) = 0, Vx,f. Assume also f[x,x) = 0, Vx and supfg^^r ||f ||oo = 
F. Let ei,...,en be i.i.d. Rademacher random variables and introduce the 
random variables 



Ze 



sup 

f6.F 



^eiejf(Xt,Xj; 



sup sup y eiajf(Xi,Xj] 

fe.Fa;||a||2<1 



M = sup 

fe.?',k=l...n 



i=1 



Then there exists a universal constant C > such that for all n and q > 2, 

(EZ" j^/'' < C (EZe + q^/^EUe + q(EM + Fn) + q^/^Fn^/^ + q^F ) . 
Also, there exists a universal constant C such that for all n and t > 0, 

nZ > CEZe +t} < exp (-^ min [ (^) 



Remark 6 A generously overestimated value of the constants may be easily 

deduced from the proof. We are convinced that these are far from being the 
best possible but do not have a good guess of what the best constants might 
be. 



PROOF. The proof of Theorem 11 is based on symmetrization, decoupling, and 
concentration inequalities for empirical processes and Rademacher chaos. 
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Since the f are degenerate kernels, one may relate the moments of Z to those 
of Ze by the randomization inequality 



EZ" < 4'"EZ^ , 

valid for q > 1, see Chapter 3 of [12]. Thus, it suffices to derive moment 

inequalities for the symmetrized U-process Zg. We do this by conditioning. 
Denote by Eg the expectation taken with respect to the variables ei (i.e., con- 
ditional expectation given Xi , . . . , Xn)- Then we write EZe = EEg Ze and study 
the quantity EeZe, with the Xi fixed. But then Z^ is a so-called Rademacher 
chaos whose tail behavior has been studied, see Talagrand [38], Ledoux [26], 
Boucheron, Bousquet, Lugosi, and Massart [9]. In particular, for any q > 2, 

(EeZ^)^^" <EeZe+ (Ee(Ze-EeZe)^)^^'' (since Z > 0) 

< EeZe + 3yq Eelie +4qB 
with Ue defined above and 



B = sup sup 

f6J='a,a':||a||2,||a'||2<1 



^aia;f(Xi,Xj) 



where the second inequality follows by Theorem 14 of [9]. Using the inequality 
(a + b + c)i <3i-^(ai + b'^ +c'^] valid for q >2, a,b,c>0, we have 

EeZ^<3''-^ ((EeZe)''+3''q''/2(EgUe)''+4''q''B'') . 

It remains to derive suitable upper bounds for the expectation of the three terms 
on the right-hand side. 

First term: E(EeZe) 

In order to handle the moments of EgZe, first we note that by a decoupling 
inequality in de la Pena and Gine [12, page 101], 

Ee Ze < 8Ee Z^ 



where 



Zg = sup 



^eie;f(Xt,Xj: 



Here , . . . , are i.i.d. Rademacher variables, independent of the Xi and the 
et. Nothe that Ee now denotes expectation taken with respect to both the et 
and the e(. 

Thus, we have 

E(EeZe)'' < 8'lE(EeZg)'' 
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In order to bound the moments of the random variable A = EeZ^, we apply 
Corollary 3 of [9]. In order to apply this corollary, define, for k = 1 , . . . ,n, the 
random variables 



Ale = Ee sup 

It is easy to see that Ai^ < A. 
On the other hand, defining 



Y_ eiejf(Xt,Xj; 



Rk = sup 



^eif(Xt,Xk) 

i=1 



we clearly have 



A- Ale < 2EeRk 



Also, denoting by f* the (random) function achieving the maximum in the 
definition of Z, we have 



^(A-Aic) < Ee ^ek_^ejf*(Xk,Xj) + _^e^_^etf*(Xt,X;,) 

lc=1 \lc=1 j=1 k=1 i=1 

= 2A, 

Therefore, 

n 

^(A- Aic)^ < 4AEelVL 

lc=1 

where M = maxicRk- Then by Corollary 3 of [9], we obtain 

E(EeZ;;)'' =EA'' <2''-^ (2''(EZ^) V5'iq''E(EeM)'') . 
By un-decoupling (see de la Pena and Gine [12, page 101]), we have EZg < 

4EZe. 

To bound E(EeM.)'', observe that Eg M is a conditional Rademacher aver- 
age, for which Theorem 13 of of [9] may be applied. According to this, 

E(EeM)'' < 2"-^ (2" (EM)" +5''q''F'") 

Collecting terms, we have 

E(EeZe)'' <128'i (EZe)" +320'iq'i (EM)" +800iF''q2i . 



Second term: Ex(EeUe)' 
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The moments of Eg lie can be estimated by the same inequality as the one we 
used for EeM since Eeilg is also a conditional Rademacher average. Observing 
that 

sup sup y~ ajf(Xi,Xj) < F-ytx 

f,i ix:[|a[|2<l 

by the Cauchy-Schwarz inequality, we have, by Theorem 13 from [9], 
E(EeUe)'' <2''-^ (2''(EUe)''+5''q''F''n''/2) . 

Third term: ExB'i 

Finally, by the Cauchy-Schwarz inequality, we have B < nF so 

ExB" <u''F'' . 

Now it remains to simply put the pieces together to obtain 

EZ" < 121 (l28''(EZ£)'' +12iq''/2(EUe)'' +320'iq''(EM)'' +4iF''n''q'i 
+30''Fin''/2q3''/^+800''F''q2'') , 

proving the announced moment inequality. 

In order to derive the exponential inequality, use Markov's inequality P{Z > 
t) < t^iEZi and choose 

'' = ^'^^^((^7) 'e^'f^^'(f^) 

for an appropriate constant C. | 

7 Convex risk minimization 

Several successful algorithms for classification, including various versions of 
boosting and support vector machines are based on replacing the loss func- 
tion by a convex function and minimizing the corresponding empirical convex 
risk functional over a certain class of functions (typically over a ball in an ap- 
propriately chosen Hilbert or Banach space of functions). This approach has 
important computational advantages, as the minimization of the empirical con- 
vex functional is often computationally feasible by gradient descent algorithms. 
Recently significant theoretical advance has been made in understanding the 
statistical behavior of such methods, see, e.g., Bartlett, Jordan, and McAuliffe 
[5], Blanchard, Lugosi and Vayatis [7], Breiman [10], Jiang [23], Lugosi and 
Vayatis [28], Zhang [41]. 
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The purpose of this section is to extend the principle of convex risk minimiza- 
tion to the ranking problem studied in this paper. Our analysis also provides 
a theoretical framework for the analysis of some successful ranking algorithms 
such as the RankBoost algorithm of Preund, Iyer, Schapire, and Singer [14]. 
In what follows we adapt the arguments of Lugosi and Vayatis [28] (where a 
simple binary classification problem was considered) to the ranking problem. 

The basic idea is to consider ranking rules induced by real- valued functions, 
that is, ranking rules of the form 

1^ —1 otherwise 

where f : A:" x A:" ^ K is some measurable real-valued function. With a slight 
abuse of notation, we will denote by L(f) = P{sgn(Z) • f(X,X'] < 0} = L(r) the 
risk of the ranking rule induced by f . (Here sgn(x) = 1 if x > 0, sgn(x) = —1 
if X < 0, and sgn(x) = if x = 0.) Let (j) : IK ^ [0,oo) be a convex cost 
/unction satisfying 4) (0) = 1 and4)(x) > I|x>o]- Typical choices of cf) include the 
exponential cost function 4>{x) — e", the "logit" function 4'(''^) = log2(l + s")] 
or the "hinge loss" 4)(x] = (1 +x) + . Define the cost functional associated to 
the cost function 4> by 

A(f)=E4)(-sgn(Z)-f(X,X')) . 

Obviously, L(f) < A(f). We denote by A* = inff A(f) the "optimal" value of 
the cost functional where the infimum is taken over all measurable functions 

The most natural estimate of the cost functional A(f ), based on the training 
data D n, is the empirical cost functional defined by the ll-statistic 

An(f)= , ^ >~4>(-sgn(Z,,j)-f(Xi,Xj)) . 

The ranking rules based on convex risk minimization we consider in this sec- 
tion minimize, over a set T of real- valued functions i : X xX ^ the empirical 
cost functional An, that is, we choose fn = argmiufg^ An(f) and assign the 
corresponding ranking rule 



rn(x,x') = 




if fn(x,x'] >0 

1 otherwise. 



(Here we assume implicitly that the minimum exists. More precisely, one may 
define as any function i & T satisfying An(fn) < inffe.?' ATv(f ) + 1 /n.) 

By minimizing convex risk functionals, one hopes to make the excess convex 
risk A(fTi) — A* small. This is meaningful for ranking if one can relate the 
excess convex risk to the excess ranking risk L(fn) — L*. This may be done 
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quite generally by recalling a result of Bartlett, Jordan, and McAuliffe [5]. To 
this end, introduce the functions 



and 



H(p) = inf (p4)(-a) + (1 - p)(l)(a)) 



H-(p)= inf (p(l)(-a) + (1 -p)(|)(a)) 

a:a(2p-1 )<0 



Defining i|j over M by 



1 +x 



1 -X 



Theorem 3 of [5] implies that for all functions i : X x X ^ 

L(f)-L* <i^-^ (A(f)-A*) 

where denotes the inverse of i]). Bartlett, Jordan, and McAuliffe show 
that, whenever (j) is convex, limx^o^"^^ M = ^, so convergence of the excess 
convex risk to zero implies that the excess ranking risk also converges to zero. 
Moreover, in most interesting cases (x) may be bounded, for x > 0, by 
a constant multiple of ^/x (such as in the case of exponential or logit cost 
functions) or even by x (e.g., if 4)(x) = (1 +x)+ is the so-called hinge loss). 

Thus, to analyze the excess ranking risk L(f) — L* for convex risk mini- 
mization, it suffices to bound the excess convex risk. This may be done by 
decomposing it into "estimation" and "approximation" errors as follows: 

A(fn) - A*(f) < fA(fn) - inf Aif)] + f inf A(f) - A^ 

Clearly, just like in Section 3, we may (loosely) bound the excess convex risk 
over the class T as 

A(fn) - inf A(f ) < 2 sup |An(f ) - A(f )| . 

To bound the right-hand side, assume, for simplicity, that the class T of func- 
tions is uniformly bounded, say supfg^^ ^^p^ |f (x)| < B. Then once again, we may 
appeal to Lemma 14 (see the Appendix) and the bounded differences inequality 
which imply that for any A > 0, 



Eexp ( Asup|An(f) -A(f)| 

fe:F 



< Eexp I A sup 



1 Ln/2J \ \ 

Y_ 4> (-sgn(Z|,Ln/2j+i) -flXi.XLn/aj+i)) -A(f) 



Ln/2J 



t=i 



< exp 



AE sup 



Ln/2J 



ln/2\ 



Y_ * (-sgn(Zi,Ln/2j+i) -flXi.XLn/aj+i)) -A(f) 



i=1 



2t32 



A^B 
2n 
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Now it suffices to derive an upper bound for the expected supremum appearing 
in the exponent. This may be done by standard symmetrization and contraction 
inequalities. In fact, by mimicking Koltchinskii and Panchenko [25] (see also 
the proof of Lemma 2 in Lugosi and Vayatis [28]), we obtain 



E 



1 



) 

I 



sup y ^ (-Sgn(Zi,Ln/2J+i) ■ f(Xi,XLn/2J+i)) -A(f) 

( 1 L-^^J \ 



<4Bcl)'(B)Esup 



V' 



1=1 



where cri , . . . , cr|^^/2j i-i-d. Rademacher random variables independent of Dn, 
that is, symmetric sign variables with P{ai = 1} = P{ffi = —1} = 1/2. 
We summarize our findings: 

Proposition 12 Let he the ranking rule minimizing the empirical convex 
risk functional An(f) over a class of functions f uniformly hounded hy — B 
and B. Then, with prohahility at least 1—6, 



A(f^) - inf A(f) < 8B(^'[B)R^[T) + 
where Rn denotes the Rademacher average 



2B2log(1/6) 
n 



^rviJ") =Esup 



Lrt/2J 



— ^ Ci ■ f(Xi,XLn/2J+i) 



Ln/2J 



i=1 



J 



Many interesting bounds are available for the Rademacher average of various 
classes of functions. For example, in analogy of boosting-type classification 
problems, one may consider a class JFb of functions defined by 

{N N 
i(x,x') = '^yvjgj[x,x') : N £ N, , ^ |wjl = B, gj £ 7^ 
j=i j=i 

where 7?. is a class of ranking rules as defined in Section 3. In this case it is easy 
to see that 

BV 

Rn(^B) < BRn(7^) < const.— 

ym 

where V is the VC dimension of the "base" class TZ. 

Summarizing, we have shown that a ranking rule based on the empirical 
minimization An(f) over a class of ranking functions of the form defined 
above, the excess ranking risk satisfies, with probability at least 1—6, 
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This inequality may be used to derive the universal consistency of such ranking 
rules. For example, the following corollary is immediate. 

Corollary 13 Let TZ be a class of ranking rules of finite VC dimension V 
such that the associated class of functions Tb is rich in the sense that 

lim inf A(f ) = A* 

B-soo f6.FB 

for all distributions of (X, Y) . Then if fn is defined as the empirical min- 
imizer of An(f) over .Fb„ where the sequence Bn satisfies Bn oo and 
Bi(P'[Bn)/Vn^O, then 

lim L(fn) = L* almost surely. 

n— too 

Classes TZ satisfying the conditions of the corollary exist, we refer the reader 
to Lugosi and Vayatis [28] for several examples. 

Proposition 12 can also be used for establishing performance bounds for 
kernel methods such as support vector machines. A prototypical kernel-based 
ranking method may be defined as follows. To lighten notation, we write W = 
X xX. 

Let k:WxW^Ebea symmetric positive definite function, that is, 

n 

y~ ai,ajk(wi,Wj) > 0, 

for all choices of n, ai , . . . , ttn £ R and wi , . . . , Wn £ W. 

A kernel-type ranking algorithm may be defined as one that performs min- 
imization of the empirical convex risk An(f) (typically based on the hinge loss 
4)(x) = (1 +x) + ) over the class of functions defined by a ball of the associated 
reproducing kernel Hilbert space of the form (where w = (x, x')) 

{N N 
f(w) = ^Cjk(wj,w) : N e N, Y CiCjk(wi, w,) < B^, wi , . . . ,wn £ W 
j=i i,j=i 

In this case we have 

y~ k((Xi,X[ri/2J+i), (Xt,X[ri/2J+i)) , 
i=1 

see, for example, Boucheron, Bousquet, and Lugosi [8]. Once again, universal 
consistency of such kernel-based ranking rules may be derived in a straightfor- 
ward way if the approximation error inf f ^jfb ^{^) ~ A* can be guaranteed to go 
to zero as B ^ 00. For the approximation properties of such kernel classes we 
refer the reader to Cucker and Smale [11], Scovel and Steinwart [32], Smale and 
Zhou [34], Steinwart [35], etc. 



2B 

Rn(:^B) < — E 

n 
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Appendix 1: Basic facts about U-statistics 

Here we recall some basic facts about U-statistics. Consider the i.i.d. random 
variables X,Xi , ...,Xn and denote by 

,^ Vq(Xt,Xi) 
n(n-l)^ 

a U-statistic of order 2 where the kernel q is a symmetric real-valued function. 

U-statistics have been studied in depth and their behavior is well understood. 
One of the classical inequalities concerning U-statistics is due to Hoeffding [21] 
which implies that, for all t > 0, 

P{|Un-EUnl >t}<2e-2L(n/2)jt^ <2e-'^-i'*' . 
Hoeffding also shows that, if = Var(q(Xi 1X2)), then 

P{|Un - EUnI > t) < 2exp (- j^?^^2t/3 ) ' 

It is important noticing here that the latter inequality may be improved 
by replacing cr^ by a smaller term. This is based on the so-called Hoeffding's 
decomposition as described below. 

The U-statistic Un is said degenerate if its kernel q satisfies 

Vx, E(q(x,X))=0. 

There are two basic representations of U-statistics which we recall next (see 
Serfling [33] for more details). 

Average of 'sums-of-i.i.d.' blocks 

This representation is the key for obtaining 'first-order' results for non- 
degenerate U-statistics. The U-statistic U^ can be expressed as 

^^ = :^X"uw2T ^ q(X7t(i),X^(Ln/2j+i)) 

where the sum is taken over all permutations 7t of {1 , . . . , n}. The idea underlying 
this representation is to reduce the analysis to the case of sums of i.i.d. random 
variables. The next simple lemma is based on this representation. 

Lemma 14 Let q^; : X x X ^ M. be real-valued functions indexed 6j/ t £ T 
where T is some set. //Xi,...,Xti are i.i.d. then for any convex nonde- 
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creasing function ip, 



Ell; I sup— !— - q^(Xi,Xj 



/ ^ Ln/2J 

< E\\> sup—— Y_ (iriXuXi^n/H+i. 



assuming the suprema are measurable and the expected values exist. 



PROOF. The proof uses the same trick Hoeffding's above-mentioned inequalities 
are based on. Observe that 



Eip 



= Ell) 



1 V 



/ 

Ln/2J 



V' 



J n! V Ln/2J 



^ C|T(X7t(i),X7i([ri/2J+i)) 



< Eip 



1 



1=1 

Ln/2J 



^Z^^pj^ I_ q^(x.(i),x^(L^/2j+t, 



y 
\ 

) 

I 



(since i|) is non-decreasing) 



1 / 1 L'^^J 

- :^{T H h^P "[^^ Y- qT(X„(i),X„(Ln/2j+i)] 



(by Jensen's inequahty) 

L)^/2J 



= Ell) 
as desired. | 



Hoeffding's decomposition 

Another way to interpret a U-statistics is as an orthogonal expansion known 
as Hoeffding's decomposition. 

Assuming that q(Xi , X2) is square integrable, 11^— EUn may be decomposed 
as a sum Ta of i.i.d. random variables plus a degenerate U-statistic Wn.. In 
order to write this decomposition, consider the following function of one variable 



h(Xi) =E(q(Xi,X) |Xi)-EUn, 
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and the function of two variables 



H(Xi , Xj ) = q (Xi , Xj ) - EUn - H(Xi ) - H(Xj ) . 
Then we have the orthogonal expansion 

Un=EUn+2T^ + Wn , 

where 

Tn = -rH(Xi), 



^ n(n — 1) 

Wn is a degenerate U-statistic because its kernel h. satisfies 

E(H(Xi,X) |Xt) = . 
Clearly, the variance of Tn is 

Var(E(q(Xi,X) |Xi)) 



VarfTn 



n 



Note that Var(E(q(Xi ,X) | Xi )) is less than Var(q(Xi ,X)] (unless q is already 
degenerate). Furthermore, the variance of the degenerate U-statistic is of 
the order Tn is thus the leading term in this orthogonal decomposition. 

Indeed, the limit distribution of ^/rL^U^ — EUn) is the normal distribution 
^f[0,4VaI{E{q{X^ ,X) | Xi)) (see [20]). This suggests that inequality (5) may 
be quite loose. 

Indeed, exploiting further Hoeffding's decomposition (combined with argu- 
ments related to decoupling, randomization and hypercontractivity of Radema- 
cher chaos) de la Pena and Gine [12] established a Bernstein's type inequality of 
the form (5) but with replaced by the variance of the conditional expectation 
(see Theorem 4.1.13 in [12]). 

Specialized to our setting with q(Xi,Xj) = I[Zi,) rlXi.X) )<0] the inequality 
of de la Pena and Gine states that 



niLn(r)-L(r)|>t}<4exp - 



nt2 



8s2 + ct 

where = Var(P{Z • r(X,X') < OIX}) is the variance of the conditional expec- 
tation and c is some constant. 
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Appendix 2: Connection with the roc curve and 
the AUG criterion 



In the bipartite ranking problem, the ROC curve (roc standing for Receiving 

Operator Characteristic, see [18]) and the AUC criterion are popular measures 
for evaluating the performance of scoring functions in applications. 

Let s : <Y ^ M be a scoring function. The ROC curve is defined by plotting 
the true positive rate 

TPRs(x) =P(S(X) > x| Y = 1) 

against the false positive rate 

FPRs(x) =P(S(X) >X| Y = -1). 

By a straightforward change of parameter, the ROC curve may be expressed 
as the graph of the power of the test defined by s(X) as a function of its level 
a: 

|3s(a) = TPRs(qs,«) 

where qs,a = inf{x £ (0, 1) : FPRs(x) < a}. 

Observe that if s(X) and Y are independent (i.e., when TPRg = FPRs), the 
ROC curve is simply the diagonal segment |3s{a) = a. This measure of accuracy 
induces a partial order on the set of all scoring functions: for any si , S2, we say 
that si is more accurate than S2 if and only if its ROC curve is above the one of 
S2 for every level a, that is, if and only if ^sii^X-] < |3si (a) for all a G (0, 1). 

Proposition 15 The regression function r\ induces an optimal ordering on 
X in the sense that its ROC curve is not below any other scoring function 
s; 

Vae[0,l], |3^(a) > |3s(a). 

PROOF. The result follows from the Neyman-Pearson lemma applied to the test 
of the null assumption "Y = —1 " against the alternative "Y = 1 " based on 
the observation X: the test based on the likelihood ratio ri(X)/(l — ri(X)) is 
uniformly more powerful than any other test based on X. | 

Remark 7 Note that the ROC curve does not characterize the scoring function. 
For any s and any strictly increasing function h. : E ^ E, s and h, o s clearly 
yield the same ordering on X : = Phos- 
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Instead of optimizing the ROC curve over a class of scoring functions which 
is a difficult task, a simple idea is to search for s that maximizes the Area Under 
the ROC Curve (known as the AUC criterion) : 



AUC(s) = 



|3s(a) da. 





This theoretical quantity may be easily interpreted in a probabilistic fashion 
as shown by the following proposition. 

Proposition 16 For any scoring function s, 

AUC(s) = W (s(X) > s(X') I Y = 1 , Y' = -1 ) , 

where (X,Y) and (X',Y') are independent pairs drawn from the binary clas- 
sification model. 

PROOF. Let U be a uniformly distributed random variable over (0, 1), indepen- 
dent of (X,Y). Denote by Fj the distribution function of s(X) given Y = — 1. 
Then 

pi 

Auc(s) = P(s(X) > Qs.oc I Y = 1) da 
Jo 

= E(P(s(X) > ¥-\U) I Y = 1)) 

= P(s(X) > s(X') I Y = 1, Y' = -1) . 
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